{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Recipe Search using Model2Vec**\n",
    "\n",
    "This notebook demonstrates how to use the Model2Vec library to search for recipes based on a given query. We will use the [recipe dataset](https://huggingface.co/datasets/Shengtao/recipe).\n",
    "We will be using the `model2vec` in different modes to search for recipes based on a query, using both our own pre-trained models, as well as a domain-specific model we will distill ourselves in this tutorial.\n",
    "\n",
    "Three modes of Model2Vec use are demonstrated:\n",
    "1. **Using a pre-trained output vocab model**: Uses a pre-trained output embedding model. This is a very small model that uses a subword tokenizer. \n",
    "2. **Using a pre-trained glove vocab model**: Uses pre-trained glove vocab model. This is a larger model that uses a word tokenizer.\n",
    "3. **Using a custom vocab model**: Uses a custom domain-specific vocab model that is distilled on a vocab created from the recipe dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install the necessary libraries\n",
    "!pip install numpy datasets scikit-learn transformers model2vec\n",
    "    \n",
    "# Import the necessary libraries\n",
    "import regex\n",
    "from collections import Counter\n",
    "\n",
    "import numpy as np\n",
    "from datasets import load_dataset\n",
    "from sklearn.metrics import pairwise_distances\n",
    "from tokenizers.pre_tokenizers import Whitespace\n",
    "\n",
    "from model2vec import StaticModel\n",
    "from model2vec.distill import distill"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the recipe dataset\n",
    "dataset = load_dataset(\"Shengtao/recipe\", split=\"train\")\n",
    "# Convert the dataset to a pandas DataFrame\n",
    "dataset = dataset.to_pandas()\n",
    "# Take the title column as our recipes corpus\n",
    "recipes = dataset[\"title\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>title</th>\n",
       "      <th>category</th>\n",
       "      <th>description</th>\n",
       "      <th>ingredients</th>\n",
       "      <th>directions</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Simple Macaroni and Cheese</td>\n",
       "      <td>main-dish</td>\n",
       "      <td>A very quick and easy fix to a tasty side-dish...</td>\n",
       "      <td>1 (8 ounce) box elbow macaroni ; ¼ cup butter ...</td>\n",
       "      <td>Bring a large pot of lightly salted water to a...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Gourmet Mushroom Risotto</td>\n",
       "      <td>main-dish</td>\n",
       "      <td>Authentic Italian-style risotto cooked the slo...</td>\n",
       "      <td>6 cups chicken broth, divided ; 3 tablespoons ...</td>\n",
       "      <td>In a saucepan, warm the broth over low heat. W...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Dessert Crepes</td>\n",
       "      <td>breakfast-and-brunch</td>\n",
       "      <td>Essential crepe recipe.  Sprinkle warm crepes ...</td>\n",
       "      <td>4  eggs, lightly beaten ; 1 ⅓ cups milk ; 2 ta...</td>\n",
       "      <td>In large bowl, whisk together eggs, milk, melt...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Pork Steaks</td>\n",
       "      <td>meat-and-poultry</td>\n",
       "      <td>My mom came up with this recipe when I was a c...</td>\n",
       "      <td>¼ cup butter ; ¼ cup soy sauce ; 1 bunch green...</td>\n",
       "      <td>Melt butter in a skillet, and mix in the soy s...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Quick and Easy Pizza Crust</td>\n",
       "      <td>bread</td>\n",
       "      <td>This is a great recipe when you don't want to ...</td>\n",
       "      <td>1 (.25 ounce) package active dry yeast ; 1 tea...</td>\n",
       "      <td>Preheat oven to 450 degrees F (230 degrees C)....</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        title              category  \\\n",
       "0  Simple Macaroni and Cheese             main-dish   \n",
       "1    Gourmet Mushroom Risotto             main-dish   \n",
       "2              Dessert Crepes  breakfast-and-brunch   \n",
       "3                 Pork Steaks      meat-and-poultry   \n",
       "4  Quick and Easy Pizza Crust                 bread   \n",
       "\n",
       "                                         description  \\\n",
       "0  A very quick and easy fix to a tasty side-dish...   \n",
       "1  Authentic Italian-style risotto cooked the slo...   \n",
       "2  Essential crepe recipe.  Sprinkle warm crepes ...   \n",
       "3  My mom came up with this recipe when I was a c...   \n",
       "4  This is a great recipe when you don't want to ...   \n",
       "\n",
       "                                         ingredients  \\\n",
       "0  1 (8 ounce) box elbow macaroni ; ¼ cup butter ...   \n",
       "1  6 cups chicken broth, divided ; 3 tablespoons ...   \n",
       "2  4  eggs, lightly beaten ; 1 ⅓ cups milk ; 2 ta...   \n",
       "3  ¼ cup butter ; ¼ cup soy sauce ; 1 bunch green...   \n",
       "4  1 (.25 ounce) package active dry yeast ; 1 tea...   \n",
       "\n",
       "                                          directions  \n",
       "0  Bring a large pot of lightly salted water to a...  \n",
       "1  In a saucepan, warm the broth over low heat. W...  \n",
       "2  In large bowl, whisk together eggs, milk, melt...  \n",
       "3  Melt butter in a skillet, and mix in the soy s...  \n",
       "4  Preheat oven to 450 degrees F (230 degrees C)....  "
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Display the first few rows of the dataset for the specified columns\n",
    "dataset[[\"title\", \"category\", \"description\", \"ingredients\", \"directions\"]].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we will set up a function to handle similarity search that we can use in this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a function to find the most similar titles in a dataset to a given query\n",
    "def find_most_similar_items(model: StaticModel, embeddings: np.ndarray, query: str, top_k=5) -> list[tuple[int, float]]:\n",
    "    \"\"\"\n",
    "    Finds the most similar items in a dataset to the given query using the specified model.\n",
    "\n",
    "    :param model: The model used to generate embeddings.\n",
    "    :param embeddings: The embeddings of the dataset.\n",
    "    :param query: The query recipe title.\n",
    "    :param top_k: The number of most similar titles to return.\n",
    "    :return: A list of tuples containing the most similar titles and their cosine similarity scores.\n",
    "    \"\"\"\n",
    "    # Generate embedding for the query\n",
    "    query_embedding = model.encode(query)[None, :]\n",
    "\n",
    "    # Calculate pairwise cosine distances between the query and the precomputed embeddings\n",
    "    distances = pairwise_distances(query_embedding, embeddings, metric='cosine')[0]\n",
    "\n",
    "    # Get the indices of the most similar items (sorted in ascending order because smaller distances are better)\n",
    "    most_similar_indices = np.argsort(distances)\n",
    "\n",
    "    # Convert distances to similarity scores (cosine similarity = 1 - cosine distance)\n",
    "    most_similar_scores = [1 - distances[i] for i in most_similar_indices[:top_k]]\n",
    "\n",
    "    # Return the top-k most similar indices and similarity scores\n",
    "    return list(zip(most_similar_indices[:top_k], most_similar_scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Using a pre-trained output vocab model**\n",
    "\n",
    "In this part, we will use a pre-trained output vocab model to encode the recipes and search using multiple queries. The output vocab model is very small and fast while still providing good results. Since the model uses a sub-word tokenizer, it is able to handle out-of-vocabulary words and provide good results even for words that are not in the base vocab."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the M2V output model from the HuggingFace hub\n",
    "model_name = \"minishlab/M2V_base_output\"\n",
    "model_output = StaticModel.from_pretrained(model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most similar recipes to 'cheeseburger':\n",
      "Title: `Double Cheeseburger`, Similarity Score: 0.9028\n",
      "Title: `Cheeseburger Chowder`, Similarity Score: 0.8574\n",
      "Title: `Cheeseburger Sliders`, Similarity Score: 0.8413\n",
      "Title: `Cheeseburger Salad`, Similarity Score: 0.8384\n",
      "Title: `Cheeseburger Soup I`, Similarity Score: 0.8298\n",
      "\n",
      "Most similar recipes to 'fattoush':\n",
      "Title: `Fattoush`, Similarity Score: 1.0000\n",
      "Title: `Lebanese Fattoush`, Similarity Score: 0.8370\n",
      "Title: `Aunty Terese's Fattoush`, Similarity Score: 0.7630\n",
      "Title: `Arabic Fattoush Salad`, Similarity Score: 0.7588\n",
      "Title: `Authentic Lebanese Fattoush`, Similarity Score: 0.7584\n"
     ]
    }
   ],
   "source": [
    "# Find recipes using the output embeddings model\n",
    "top_k = 5\n",
    "\n",
    "# Find the most similar recipes to the given queries\n",
    "query = \"cheeseburger\"\n",
    "embeddings = model_output.encode(recipes)\n",
    "\n",
    "results = find_most_similar_items(model_output, embeddings, query, top_k)\n",
    "print(f\"Most similar recipes to '{query}':\")\n",
    "for idx, score in results:\n",
    "    print(f\"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}\")\n",
    "    \n",
    "print()\n",
    "\n",
    "query = \"fattoush\"\n",
    "results = find_most_similar_items(model_output, embeddings, query, top_k)\n",
    "print(f\"Most similar recipes to '{query}':\")\n",
    "for idx, score in results:\n",
    "    print(f\"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}\")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, we get some good results for the queries. The model is able to find recipes that are similar to the query."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Using a pre-trained output vocab model**\n",
    "\n",
    "In this part, we will use a pre-trained glove vocab model to encode the recipes and search using multiple queries. The glove vocab model is a bit larger and slower than the output vocab model but can provide better results. However, as we will see, it suffers from the out-of-vocabulary problem, since the glove vocab is not designed for the cooking recipe domain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the M2V glove model from the HuggingFace hub\n",
    "model_name = \"minishlab/M2V_base_glove\"\n",
    "model_glove = StaticModel.from_pretrained(model_name)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 92,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most similar recipes to 'cheeseburger':\n",
      "Title: `Double Cheeseburger`, Similarity Score: 0.8744\n",
      "Title: `Cheeseburger Meatloaf`, Similarity Score: 0.8246\n",
      "Title: `Cheeseburger Salad`, Similarity Score: 0.8160\n",
      "Title: `Hearty American Cheeseburger`, Similarity Score: 0.8006\n",
      "Title: `Cheeseburger Chowder`, Similarity Score: 0.7989\n",
      "\n",
      "Most similar recipes to 'fattoush':\n",
      "Title: `Simple Macaroni and Cheese`, Similarity Score: 0.0000\n",
      "Title: `Fresh Tomato and Cucumber Salad with Buttery Garlic Croutons`, Similarity Score: 0.0000\n",
      "Title: `Grilled Cheese, Apple, and Thyme Sandwich`, Similarity Score: 0.0000\n",
      "Title: `Poppin' Turkey Salad`, Similarity Score: 0.0000\n",
      "Title: `Chili - The Heat is On!`, Similarity Score: 0.0000\n"
     ]
    }
   ],
   "source": [
    "# Find recipes using the output embeddings model\n",
    "top_k = 5\n",
    "\n",
    "# Find the most similar recipes to the given queries\n",
    "query = \"cheeseburger\"\n",
    "embeddings = model_glove.encode(recipes)\n",
    "\n",
    "results = find_most_similar_items(model_glove, embeddings, query, top_k)\n",
    "print(f\"Most similar recipes to '{query}':\")\n",
    "for idx, score in results:\n",
    "    print(f\"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}\")\n",
    "    \n",
    "print()\n",
    "\n",
    "query = \"fattoush\"\n",
    "results = find_most_similar_items(model_glove, embeddings, query, top_k)\n",
    "print(f\"Most similar recipes to '{query}':\")\n",
    "for idx, score in results:\n",
    "    print(f\"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}\")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, we get good results when we search for an in vocab query (`cheeseburger`), but when we search for an out-of-vocab query (`fattoush`), the model is not able to find any relevant recipes. To fix this, we will now distill a custom vocab model on the recipe dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Using a custom vocab model**\n",
    "\n",
    "In this part, we will distill a custom vocab model on the recipe dataset and use it to encode the recipes and search using multiple queries. This will create a domain-specific model2vec model. First, we will set up a function to create a vocabulary from a list of texts (in our case, a list of recipe titles)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set up a regex tokenizer to split texts into words and punctuation\n",
    "my_regex = regex.compile(r\"\\w+|[^\\w\\s]+\")\n",
    "\n",
    "def create_vocab(texts: list[str], tokenizer: Whitespace, size: int = 30_000) -> list[str]:\n",
    "    \"\"\"\n",
    "    Create a vocab from a list of texts.\n",
    "    \n",
    "    :param texts: A list of texts.\n",
    "    :param tokenizer: A whitespace tokenizer.\n",
    "    :param size: The size of the vocab.\n",
    "    :return: A vocab sorted by frequency.\n",
    "    \"\"\"\n",
    "    counts = Counter()\n",
    "    for text in texts:\n",
    "        tokens = tokenizer.pre_tokenize_str(text.lower())\n",
    "        tokens = [token for token, _ in tokens]\n",
    "        counts.update(tokens)\n",
    "    vocab = [word for word, _ in counts.most_common(size)]\n",
    "    return vocab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 8/8 [00:08<00:00,  1.04s/it]\n"
     ]
    }
   ],
   "source": [
    "# Choose a Sentence Transformer model and a tokenizer\n",
    "model_name = \"BAAI/bge-small-en-v1.5\"\n",
    "tokenizer = Whitespace()\n",
    "\n",
    "# Create a custom vocab from the recipe titles\n",
    "vocab = create_vocab(recipes, tokenizer)\n",
    "\n",
    "# Distill a model2vec model using the Sentence Transformer model and the custom vocab\n",
    "model_custom = distill(model_name=model_name, vocabulary=vocab, pca_dims=256)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most similar recipes to 'cheeseburger':\n",
      "Title: `Cheeseburger Salad`, Similarity Score: 0.9528\n",
      "Title: `Cheeseburger Casserole`, Similarity Score: 0.9030\n",
      "Title: `Cheeseburger Chowder`, Similarity Score: 0.8635\n",
      "Title: `Cheeseburger Pie`, Similarity Score: 0.8401\n",
      "Title: `Cheeseburger Meatloaf`, Similarity Score: 0.8184\n",
      "\n",
      "Most similar recipes to 'fattoush':\n",
      "Title: `Fattoush`, Similarity Score: 1.0000\n",
      "Title: `Fatoosh`, Similarity Score: 0.7488\n",
      "Title: `Lebanese Fattoush`, Similarity Score: 0.6344\n",
      "Title: `Arabic Fattoush Salad`, Similarity Score: 0.6108\n",
      "Title: `Fattoush (Lebanese Salad)`, Similarity Score: 0.5669\n"
     ]
    }
   ],
   "source": [
    "# Find recipes using the output embeddings model\n",
    "top_k = 5\n",
    "\n",
    "# Find the most similar recipes to the given queries\n",
    "query = \"cheeseburger\"\n",
    "embeddings = model_custom.encode(recipes)\n",
    "\n",
    "results = find_most_similar_items(model_custom, embeddings, query, top_k)\n",
    "print(f\"Most similar recipes to '{query}':\")\n",
    "for idx, score in results:\n",
    "    print(f\"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}\")\n",
    "    \n",
    "print()\n",
    "\n",
    "query = \"fattoush\"\n",
    "results = find_most_similar_items(model_custom, embeddings, query, top_k)\n",
    "print(f\"Most similar recipes to '{query}':\")\n",
    "for idx, score in results:\n",
    "    print(f\"Title: `{recipes[idx]}`, Similarity Score: {score:.4f}\")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, we now get good results for both queries with our custom vocab model since the domain-specific terms are included in the vocab."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
